tSNE!

t-distributed stochastic neighbor embedding (tSNE) is a nonlinear, nonparametric, and unsupervised dimension reduction machine learning algorithm. It is used to find patterns in high-dimensional data.

Recall that dimension reduction techniques such as PCA help us reduce high-dimensional linear data into a reduced feature space, such as 2 or 3 main axes of “distilled” variation that can be efficiently visualized.

These visualizations often look a little nicer than those for PCA because instead of plotting distances between observations, tSNE plots the probabilities instead! This is based on Kullback-Leibler divergences (the loss function). It becomes difficult to say what PCA data separation looks like in higher-dimensional space because it can be dubious to extrapolate lower dimension representations into higher ones.

Some key hyperparameters include:

  • dims - the number of dimensions to be returned.
  • Perplexity - essentially the number of nearest neighbors, but in the curved/surface-like manifold setting instead of stright-line distances. Should be less than the number of observations, but it is not that simple…
  • theta - the Barnes-Hut tradeoff, ranging from 0 to 1. This is the speed/accuracy tradeoff with lower values give slower but more accurate optimizations. 0.0 returns he exact tSNE value (defaults to 0.5).
  • eta - learning rate.
  • check_duplicates - should duplicate observations be removed?

Package installation

Run these lines manually if you need to install or update the following packages:

if (FALSE) {
  install.packages(c(
    # train/test data splitting
    "caret",
    # Our sole ML algorithm this time around
    "randomForest",
    # tSNE algorithms
    "Rtsne", "tsne"
    )) 
}

Library the required packages

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(Rtsne)
library(tsne)

Load the iris dataset

data(iris)

# Learn about the dawta
?iris

# View its structure
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# How many of each species?
table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50

Goals

We will fit one model using the tsne package and one using the Rtsne package. Then, we will use the Rtsne model to add coordinates to our dataset and to train and evaluate a random forest algorithm on these new data.

tsne package

Here, the help files outline a concise way to fit the tSNE algorithm via a brief plotting function:

# Define colors for plotting
colors = rainbow(length(unique(iris$Species)))

# Assign one color to each species
names(colors) = unique(iris$Species)
colors
##      setosa  versicolor   virginica 
## "#FF0000FF" "#00FF00FF" "#0000FFFF"
# Define the function
ecb = function(x,y){
  plot(x,t = 'n')
  text(x,labels = iris$Species, col = colors[iris$Species]) 
  }

# Fit
set.seed(1)
system.time({
tsne_iris = tsne::tsne(iris[, -5], epoch_callback = ecb, perplexity = 50)
})
## sigma summary: Min. : 0.565012665854053 |1st Qu. : 0.681985646004023 |Median : 0.713004330336136 |Mean : 0.716213420895748 |3rd Qu. : 0.74581655363904 |Max. : 0.874979764925049 |
## Epoch: Iteration #100 error is: 12.5419603996613
## Epoch: Iteration #200 error is: 0.255642913624415

## Epoch: Iteration #300 error is: 0.243735702264651

## Epoch: Iteration #400 error is: 0.24370348684716

## Epoch: Iteration #500 error is: 0.243703479565549

## Epoch: Iteration #600 error is: 0.243703479562828

## Epoch: Iteration #700 error is: 0.243703479562827

## Epoch: Iteration #800 error is: 0.243703479562828

## Epoch: Iteration #900 error is: 0.243703479562827

## Epoch: Iteration #1000 error is: 0.243703479562827

##    user  system elapsed 
##  14.804   1.273  17.382

Rtsne example

Rtsne provides clearer hyperparameters, better help, and more flexibility compared to the tsne model.

# You might want to remove duplicate observations (even if they are stochastic)... (so that you are not computing distances between two identical points?)

set.seed(1)
Rtsne_iris <- Rtsne::Rtsne(as.matrix(iris[, -5]), 
                    # Return just the first two dimensions
                    dims = 2,
                    # Let's set perplexity to 5% of the number of rows
                    # Try setting it to a larger value as well, like 25%
                    perplexity = nrow(iris) * 0.05,
                    # try changing theta to 0.0 to see what happens
                    theta = 0.5, 
                    # change eta to 0 and see what happens!
                    eta = 1, 
                    # Tell the algorithm it is okay to have duplicate rows
                    check_duplicates = F) 
# Unpack!
names(Rtsne_iris)
##  [1] "theta"               "perplexity"          "N"                   "origD"               "Y"                   "costs"              
##  [7] "itercosts"           "stop_lying_iter"     "mom_switch_iter"     "momentum"            "final_momentum"      "eta"                
## [13] "exaggeration_factor"
# Plot first two dimensions
plot(Rtsne_iris$Y[, 1:2],col = iris$Species) 

Visual comparison to PCA

pca_iris = princomp(iris[,1:4])$scores[,1:2]
plot(pca_iris, t = 'n')
text(pca_iris, labels = iris$Species, col = colors[iris$Species])

A machine learning example

Let’s recapitulate Mark Borg’s walkthrough here. Let’s keep working with our Rtsne_iris model from above. cbind the tSNE coordinates into our dataset in order to fit a random forest on this new dataset!

# Add tSNE coordinates via cbind
data = cbind(iris, Rtsne_iris$Y)

# Rename the new columns
colnames(data)[6] = "tSNE_Dim1"
colnames(data)[7] = "tSNE_Dim2"

# Check out the dataset
head(data)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species tSNE_Dim1  tSNE_Dim2
## 1          5.1         3.5          1.4         0.2  setosa  9.111711 -3.6912105
## 2          4.9         3.0          1.4         0.2  setosa 14.571546  3.1283754
## 3          4.7         3.2          1.3         0.2  setosa 16.083600 -0.2395849
## 4          4.6         3.1          1.5         0.2  setosa 15.695034  0.6848413
## 5          5.0         3.6          1.4         0.2  setosa 10.379982 -4.4955670
## 6          5.4         3.9          1.7         0.4  setosa  8.042246 -9.6696125
# Split the data
set.seed(1)
split = caret::createDataPartition(data$Species, p = 0.75, list = FALSE)
training_set = data[split,]
test_set = data[-split,]

# Identify species "target" variable and predictors for train and test sets
X_train = training_set[, -5]
Y_train = training_set$Species

X_test = test_set[, -5]
Y_test = test_set$Species

Fit the random forest:

set.seed(1)
RF = randomForest(X_train, Y_train, X_test, Y_test,
                  ntree = 500, 
                  proximity = T,
                  importance = T,
                  keep.forest = T,
                  do.trace = T)
predicted = predict(RF, X_test)
table(predicted, Y_test)
##             Y_test
## predicted    setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         12         1
##   virginica       0          0        11
mean(predicted == Y_test)
## [1] 0.9722222
varImpPlot(RF)

Resources

tSNE FAQ. Laurens van der Maaten blog.

Cao, Y and L Wang. 2017. Automatic selection of t-SNE perplexity. Journal of Machine Learning Research: Workshop and Conference Proceedings 1:1-7.

Linderman, GC and S. Stenerberger. 2017. Clustering with t-SNE, provably. arXiv:1706.02582 [cs.LG].

Pezzotti et al. 2017. Approximated and user steerable tSNE for progressive visual analytics. IEEE Transactions on Visualization and Computer Graphics 23:1739-1752.

Schubert E. and M. Gertz. 2017. Intrinsic t-stochastic neighbor embedding for visualization and outlier detection: A remedy against the curse of dimensionality? In: Beecks C., Borutta F., Kröger P., Seidl T. (eds) Similarity Search and Applications (SISAP). Lecture Notes in Computer Science, Springer, 10609:188-203.

Wattenberg et al. 2016. How to use t-SNE effectively

colah’s blog. 2015. Visualizing representations: Deep learning and human beings.

Wang W et al. 2015. On deep multi-view representation learning. Journal of Machine Learning Research: Workshop and Conference Proceedings 37.

van der Maaten, LJP. 2014. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research, 15:3221-3245.

Hamel, P and D. Eck. 2010. Learning features from music audio with deep belief networks. 11th International Society for Music Information Retrieval Conference 339-344.

Jamieson AR et al. 2010. Exploring nonlinear feature space dimension reduction and data representation in breast CADx with Laplacian eigenmaps and t-SNE. Medical Physics 37:339-351.

van der Maaten, LJP. 2009. Learning a Parametric Embedding by Preserving Local Structure. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), Journal of Machine Learning Research Workshop and Conference Proceedings 5:384-391.

van der Maaten LJP and GE Hinton. 2008. Visualizing Data Using t-SNE. Journal of Machine Learning Research 9:2579-2605.

Also check out umapr and uwot.